RAPToR - ShowcaseThis vignette showcases different uses for RAPToR.
The data in this example is a C. elegans microarray
profiling experiment by Lehrbach et al.
(2012), hereafter called dslehrbach2012 (Accession :
E-MTAB-1333).
The experimental setup is the following. Two C. elegans
strains : wild-type (wt) and pash-1(mj100)
(mut) were profiled at 0, 6, 12 and 24 hours past the L4
stage (~ 50 hours post-hatching at 20°C). Each condition was profiled in
triplicates.
Here, finding Differential Gene Expression (DE) between strains relies on the inclusion of the effect of development. This is true in any time series context, even more so in this case because late-larval development of C. elegans is known for drastic changes in gene expression within a short time frame (snoek2014rapid).
Code to generate dslehrbach2012 :
Note : set the data_folder variable to an
existing path on your system where you want to store the
objects.
data_folder <- "../inst/extdata/"
library("biomaRt") # May need to be installed with bioconductor
library("limma") # ..
library("affy") # ..
library("gcrma") # ..
requireNamespace("wormRef", quietly = T)
requireNamespace("utils", quietly = T)
dslehrbach202# get probe ids from the arrayexpress
probe_ids <- read.table("https://www.ebi.ac.uk/arrayexpress/files/A-AFFY-60/A-AFFY-60.adf.txt", h=T,
comment.char = "", sep = "\t", skip = 17, as.is = T)
# pheno data
p_url <- "https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-1333/E-MTAB-1333.sdrf.txt"
p <- read.table(p_url, h = T, sep = "\t", comment.char = "", as.is = T)
# geno data
g_url <- unique(p$Comment..ArrayExpress.FTP.file.)
g_dir <- paste0(data_folder, "dslehrbach2012_raw/")
g_file <- paste0(g_dir, "raw.zip")
if(!file.exists(g_file)){
dir.create(g_dir)
utils::download.file(g_url, destfile = g_file)
}
utils::unzip(g_file, exdir = g_dir)
g <- affy::ReadAffy(filenames = p$Array.Data.File, celfile.path = g_dir, phenoData = p)
g <- affy::expresso(g, bg.correct = F, normalize = F,
pmcorrect.method = "pmonly", summary.method = "median")
g <- 2^exprs(g) # expresso log2s the data
colnames(g) <- p$Source.Name
# format ids
g <- RAPToR::format_ids(g, probe_ids, from = 1, to = 7)
g <- RAPToR::format_ids(g, wormRef::Cel_genes, from = 3, to = 1)
# filter relevant fields
p <- p[, c(1,22,24)]
colnames(p) <- c("title", "tpastL4", "strain")
p$strain <- factor(p$strain, levels = c('wild type', 'pash-1(mj100)'), labels = c('wt', 'strain'))
p$rep <- factor(gsub("\\w+_h\\d+\\.(\\d)","\\1", p$title))
dslehrbach2012 <- list(g = g, p = p)
save(dslehrbach2012, file = paste0(data_folder, "dslerhbach2012.RData"), compress = "xz")
# cleanup
file.remove(g_file)
unlink(g_dir, recursive = T)
rm(g, g_url, g_dir, g_file, p, p_url, probe_ids)
dslehrbach2012$g <- limma::normalizeBetweenArrays(dslehrbach2012$g, method = "quantile")
dslehrbach2012$g <- log1p(dslehrbach2012$g)
The sample (chronological) ages range from the fourth larval molt
(L4) to 24h past L4. The Cel_YA_2 reference from
wormRef covers this range, so we’ll select it to stage the
samples.
r_ya <- prepare_refdata("Cel_YA_2", datapkg = "wormRef", n.inter = 200)
ae_dslb <- ae(dslehrbach2012$g, r_ya)
We can see quite a lot of variation in the sample timings, especially at the 0 and 6 hour timepoints. Another interesting effect can be seen when grouping the samples by replicate.
The first two replicates have the mut samples
consistently older than wt, while the opposite is true for
the third replicate. This effect is statistically significant.
age_dif <- ae_dslb$age.estimates[dslehrbach2012$p$strain=="wt",1] -
ae_dslb$age.estimates[dslehrbach2012$p$strain!="wt",1]
reps <- dslehrbach2012$p$rep[dslehrbach2012$p$strain=="wt"]
summary(lm(age_dif~reps))$coefficients
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.216851 0.778402 -2.847949 0.0191531 *
#> reps2 -0.936003 1.100827 -0.850272 0.4172190
#> reps3 3.497698 1.100827 3.177335 0.0112327 *
cc <- cor(dslehrbach2012$g, dslehrbach2012$g, method = "spearman")
By comparing the correlation between sample pairs to their age difference, we can show how development impacts overall gene expression in the data.
The graph below shows this, with each point corresponding to a pair of samples and color-coding whether both samples are of the same strain or not.
We see the strain does not impact correlation between samples, in comparison to developmental differences. This is also visible through correlation heatmaps ordered by strain and chronological age, or estimated age.
We can perform a simple DE analysis with limma, using
the following model :
\[X \sim strain \times \texttt{ns(age, df = 2)} \]
Non-linear dynamics in the data are taken care of by a spline with \(\texttt{ns()}\).
We’ll look at the number of DE genes due to development or strain, which translates to comparing the following nested models (2 vs. 1 for development, 3 vs. 1 for strain).
\[ \begin{align} Y & = \beta_0 + \beta_1 I_{strain} + (\alpha_1 age_{sp1} + \alpha_2 age_{sp2}) + (\gamma_1 I_{strain} age_{sp1} + \gamma_2 I_{strain} age_{sp2})\\ Y & = \beta_0 + \beta_1 I_{strain}\\ Y & = \beta_0 + (\alpha_1 age_{sp1} + \alpha_2 age_{sp2}) \end{align} \]
Note that we don’t consider whether genes are up- or down-regulated, but only the detection of an effect. Genes will be considered DE with Benjamini-Holm adjusted p.values of coefficients below \(0.05\).
DE code :
# Predictions from limma model
pred_lmFit <- function(Fit){
tcrossprod(Fit$coefficients, Fit$design)
}
# Compute Goodness of Fit
GoF <- function(Fit, X){
pred <- pred_lmFit(Fit)
res <- (X - pred)
ss <- apply(X, 1, function(ro) sum((ro - mean(ro))^2))
Rsq <- sapply(seq_len(nrow(X)), function(i){
1 - sum(res[i,]^2)/ss[i]
})
return(Rsq)
}
DGE <- function(X, strain, age, df = 2, name = NULL, return.model = FALSE){
require(splines)
if(! length(strain) == ncol(X) | ! length(age) == ncol(X))
stop("strain and age must be of length ncol(X).")
# make pdat df
pdat <- data.frame(strain = factor(strain), age = as.numeric(age), row.names = colnames(X))
# build design matrix
d <- model.matrix(~ 1 + ns(age, df = df) * strain, data = pdat)
# fix colnames
colnames(d) <- c("b0", paste0(rep("a", df), 1L:df), "strainmut", paste0(rep("strainmut.a", df), 1L:df))
# build contrast matrices for mut and age tests
if(df == 2){
cm.mut <- makeContrasts(mut = strainmut, #strainwt - strainmut,
mut.i1 = strainmut.a1,
mut.i2 = strainmut.a2,
levels = d)
cm.age <- makeContrasts(a1, a2,
a.i1 = strainmut.a1,
a.i2 = strainmut.a2,
levels = d)
}
if(df == 3){
cm.mut <- makeContrasts(mut = strainmut, #strainwt - strainmut,
mut.i1 = strainmut.a1,
mut.i2 = strainmut.a2,
mut.i3 = strainmut.a3,
levels = d)
cm.age <- makeContrasts(a1, a2, a3,
a.i1 = strainmut.a1,
a.i2 = strainmut.a2,
a.i3 = strainmut.a3,
levels = d)
}
# fit model
m.0 <- lmFit(object = X, design = d)
# get GoF
gof <- GoF(Fit = m.0, X = X)
# find DE genes for mut
m.m <- contrasts.fit(m.0, contrasts = cm.mut)
m.m <- eBayes(m.m)
# find DE genes for age
m.a <- contrasts.fit(m.0, contrasts = cm.age)
m.a <- eBayes(m.a)
Tmut <- topTable(m.m, adjust.method = "BH", number = Inf, sort.by = "none")[, c("F", "P.Value", "adj.P.Val")]
Tage <- topTable(m.a, adjust.method = "BH", number = Inf, sort.by = "none")[, c("F", "P.Value", "adj.P.Val")]
res <- list(gof = gof,
tmut = Tmut,
tage = Tage,
name = name)
if(return.model)
res$model = m.0
rm(m.0, m.m, m.a, Tmut, Tage, d, X, pdat, gof)
gc(verbose = F)
return(res)
}
To determine whether there is an improvement in DE analysis when using age estimates or chronological age, we use the same model with either predictor and compare results.
# format gdata
X <- log2(exp(dslehrbach2012$g))
dge.ca <- DGE(X = X, strain = dslehrbach2012$p$strain, age = dslehrbach2012$p$tpastL4,
name = "dge.ca", return.model = T)
dge.ae <- DGE(X = X, strain = dslehrbach2012$p$strain, age = ae_dslb$age.estimates[,1],
name = "dge.ae", return.model = T)
Below, we show adjusted p.values for detection of an effect for each gene, when using chronological or estimate age in the model. Red bars correspond to the 0.05 threshold for significance and gene counts for each quadrant are color-matched.
mut_dif.ca <- sum(dge.ca$tmut$adj.P.Val < 0.05)
mut_dif.ae <- sum(dge.ae$tmut$adj.P.Val < 0.05)
age_dif.ca <- sum(dge.ca$tage$adj.P.Val < 0.05)
age_dif.ae <- sum(dge.ae$tage$adj.P.Val < 0.05)
100 * (mut_dif.ae - mut_dif.ca) / (mut_dif.ca) # mut pct increase
#> [1] 96.39528
100 * (age_dif.ae - age_dif.ca) / (age_dif.ca) # age pct increase
#> [1] 7.674787
We detect nearly twice as many DE genes for strain using our estimates compared to using the chronological age, and around \(8\%\) more genes with development. This, along with the large proportion of genes changing with development (\(57 - 62\%\)) compared to strain (\(7-16\%\)) shows how crucial it is to properly include developmental dynamics in DE analyses.
This increase in DE genes is at least partially explained by better model fits. Below, we show the fit distribution across genes, The goodness-of-fit (GoF) computed is an \(R^2 = 1 - \frac{SS_{res}}{SS_{tot}}\) per gene.
If the asynchronicity we observe in age estimates is erroneous, then using random noise around the chronological age values should yield similar results to using our age estimates in the model. To test this, we can simulate age sets with added noise of similar distribution to the age differences observed between chronological and estimated age.
As done above, we compare results with the model goodness-of-fit (GoF) per gene: an \(R^2 = 1 - \frac{SS_{res}}{SS_{tot}}\).
We also look at the number of DE genes found for strain and development (defined as those with a BH-adjusted p.value below 0.05).
age_diffs <- (dslehrbach2012$p$tpastL4 + 50) - ae_dslb$age.estimates[,1]
# Note : + 50 to shift tpastL4 to age, avoiding generation of negative tpastl4 values,
# this has no impact on the DGE analysis (it just shifts the age window).
# estimate density function of age_diffs
d_ad <- density(age_diffs)
# generate age sets of with random age_diffs-like noise
set.seed(10)
n <- 100
rd_ages <- lapply(seq_len(n), function(i){
(dslehrbach2012$p$tpastL4 + 50) + sample(x = d_ad$x, size = nrow(dslehrbach2012$p), prob = d_ad$y, replace = T)
})
# setup cluster for parallelization
cl <- makeCluster(3, "FORK")
# do DGE on all age sets
rd_dges <- parLapply(cl, seq_len(n), function(i){
cat("\r", i,"/",n)
DGE(X, dslehrbach2012$p$strain, rd_ages[[i]], name = paste0("rd.",i))
})
stopCluster(cl)
gc()
We can see that using the estimated age leads to better model fits and increased detection of DE genes both for strain and along development.
Below, GoF distributions ordered by chronological age quantiles (binned along the quantile scale) and shown as boxplots also clearly show that for the same genes, we tend to have better fits with estimated age than chronological age, while that is not the case when adding noise.
Profiled samples are not always from well-studied organisms, with an abundance of reference time-series data available. However, referring to the closest model organism as a reference can still yield proper age estimates, thanks to the conserved nature of developmental processes across species.
Indeed, samples can be staged cross-species using ortholog genes.
We’ll be staging C. elegans single embryos on a D. melanogaster reference using the following data:
dslevin2016cel. (Accession : GSE60755)dsgraveley2011. (Data downloaded from fruitfly.org)We’ll use a set of orthologs between D. melanogaster and
C. elegans, established by Li et al.
(2014). This list will be stored in the glist
object.
To avoid confusion between datasets, in the plots below, the D.melanogaster samples will be in red and the C. elegans samples will be in blue
Code to generate glist, dslevin2016cel and
dsgraveley2011 :
Note : set the data_folder variable to an
existing path on your system where you want to store the
objects.
data_folder <- "../inst/extdata/"
requireNamespace("wormRef", quietly = T)
requireNamespace("utils", quietly = T)
requireNamespace("GEOquery", quietly = T) # May need to be installed with bioconductor
requireNamespace("Biobase", quietly = T)
raw2tpm <- function(rawcounts, genelengths){
if(nrow(rawcounts) != length(genelengths))
stop("genelengths must match nrow(rawcounts).")
x <- rawcounts/genelengths
return(t( t(x) * 1e6 / colSums(x) ))
}
fpkm2tpm <- function(fpkm){
return(exp(log(fpkm) - log(colSums(fpkm)) + log(1e6)))
}
requireNamespace("biomaRt", quietly = TRUE)
mart <- biomaRt::useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
droso_genes <- biomaRt::getBM(attributes = c("ensembl_gene_id",
"ensembl_transcript_id",
"external_gene_name",
"transcript_end", "transcript_start"),
mart = mart)
droso_genes$transcript_length <- droso_genes$transcript_end - droso_genes$transcript_start
droso_genes <- droso_genes[,c(1:3,6)]
colnames(droso_genes)[1:3] <- c("fb_id", "transcript_id", "gene_name")
rm(mart)
glist
Get list of ortholog genes between C. elegans and D. melanogaster from Li et al. (2014) supplementary table 1.
tmp_file <- paste0(data_folder, "dmel_cel_orth.zip")
tmp_fold <- paste0(data_folder, "dmel_cel_orth/")
f_url <- "https://genome.cshlp.org/content/suppl/2014/05/15/gr.170100.113.DC1/Supplemental_Files.zip"
utils::download.file(url = f_url, destfile = tmp_file)
utils::unzip(tmp_file, exdir = tmp_fold)
glist <- read.table(paste0(tmp_fold, "Supplementary\ files/TableS1\ fly-worm\ ortholog\ pairs.txt"),
skip = 1, h=T, sep = "\t", as.is = T, quote = "\"")
colnames(glist) <- c("fb_id", "dmel_name", "cel_id", "cel_name")
glist$wb_id <- wormRef::Cel_genes[match(glist$cel_id, wormRef::Cel_genes$sequence_name), "wb_id"]
save(glist, file = paste0(data_folder, "sc2_glist.RData"), compress = "xz")
file.remove(tmp_file)
unlink(tmp_fold, recursive = T)
rm(tmp_file, tmp_fold, f_url)
dslevin2016cel
geo_dslevin2016cel <- "GSE60755"
g_url_dslevin2016cel <- GEOquery::getGEOSuppFiles(geo_dslevin2016cel, makeDirectory = FALSE, fetch_files = FALSE)
g_file_dslevin2016cel <- paste0(data_folder, "dslevin2016cel.txt.gz")
utils::download.file(url = as.character(g_url_dslevin2016cel$url[1]), destfile = g_file_dslevin2016cel)
X_dslevin2016cel <- read.table(gzfile(g_file_dslevin2016cel), h = T, sep = '\t', as.is = T, row.names = 1, comment.char = "")
# filter poor quality samples
cm_dslevin2016cel <- RAPToR::cor.gene_expr(X_dslevin2016cel, X_dslevin2016cel)
f_dslevin2016cel <- which(0.67 > apply(cm_dslevin2016cel, 1, quantile, probs = .99))
X_dslevin2016cel <- X_dslevin2016cel[, -f_dslevin2016cel]
# convert to tpm & FBgn
X_dslevin2016cel <- X_dslevin2016cel[rownames(X_dslevin2016cel)%in%wormRef::Cel_genes$sequence_name,]
X_dslevin2016cel <- raw2tpm(rawcounts = X_dslevin2016cel,
genelengths = wormRef::Cel_genes$transcript_length[match(rownames(X_dslevin2016cel),
wormRef::Cel_genes$sequence_name)])
X_dslevin2016cel <- RAPToR::format_ids(X_dslevin2016cel, wormRef::Cel_genes, from = "sequence_name", to = "wb_id")
# pheno data
P_dslevin2016cel <- Biobase::pData(GEOquery::getGEO(geo_dslevin2016cel, getGPL = F)[[1]])
# filter relevant fields/samples
P_dslevin2016cel <- P_dslevin2016cel[, c("title", "geo_accession", "time point (minutes after 4-cell):ch1")]
colnames(P_dslevin2016cel)[3] <- "time"
P_dslevin2016cel$title <- as.character(P_dslevin2016cel$title)
P_dslevin2016cel <- P_dslevin2016cel[P_dslevin2016cel$title %in% colnames(X_dslevin2016cel),]
# formatting
P_dslevin2016cel$age <- as.numeric(P_dslevin2016cel$time) / 60
P_dslevin2016cel <- P_dslevin2016cel[order(P_dslevin2016cel$age),]
X_dslevin2016cel <- X_dslevin2016cel[, P_dslevin2016cel$title]
P_dslevin2016cel$title <- gsub('Metazome_CE_timecourse_', '', P_dslevin2016cel$title)
colnames(X_dslevin2016cel) <- P_dslevin2016cel$title
# remove extra outlier
X_dslevin2016cel <- X_dslevin2016cel[,-127]
P_dslevin2016cel <- P_dslevin2016cel[-127,]
dslevin2016cel <- list(g = X_dslevin2016cel, p = P_dslevin2016cel)
save(dslevin2016cel, file = paste0(data_folder, "dslevin2016cel.RData"), compress = "xz")
# cleanup
file.remove(g_file_dslevin2016cel)
rm(geo_dslevin2016cel, g_url_dslevin2016cel, g_file_dslevin2016cel,
f_dslevin2016cel, cm_dslevin2016cel, X_dslevin2016cel, P_dslevin2016cel)
dsgraveley2011
g_url_dsgraveley2011 <- "ftp://ftp.fruitfly.org/pub/download/modencode_expression_scores/Celniker_Drosophila_Annotation_20120616_1428_allsamps_MEAN_gene_expression.csv.gz"
g_file_dsgraveley2011 <- paste0(data_folder, "dsgraveley2011.csv.gz")
utils::download.file(g_url_dsgraveley2011, destfile = g_file_dsgraveley2011)
X_dsgraveley2011 <- read.table(gzfile(g_file_dsgraveley2011), sep = ',', row.names = 1, h = T)
# convert gene ids to FBgn
X_dsgraveley2011 <- RAPToR::format_ids(X_dsgraveley2011, droso_genes, from = "gene_name", to = "fb_id")
# select embryo time series samples
X_dsgraveley2011 <- X_dsgraveley2011[,1:12]
P_dsgraveley2011 <- data.frame(sname = colnames(X_dsgraveley2011),
age = as.numeric(gsub("em(\\d+)\\.\\d+hr", "\\1", colnames(X_dsgraveley2011))),
stringsAsFactors = FALSE)
dsgraveley2011 <- list(g = X_dsgraveley2011, p = P_dsgraveley2011)
save(dsgraveley2011, file = paste0(data_folder, "dsgraveley2011.RData"), compress = "xz")
# cleanup
file.remove(g_file_dsgraveley2011)
rm(g_url_dsgraveley2011, g_file_dsgraveley2011, X_dsgraveley2011, P_dsgraveley2011)
dsgraveley2011$g <- limma::normalizeBetweenArrays(dsgraveley2011$g, method = "quantile")
dsgraveley2011$g <- log1p(dsgraveley2011$g)
dslevin2016cel$g <- limma::normalizeBetweenArrays(dslevin2016cel$g, method = "quantile")
dslevin2016cel$g <- log1p(dslevin2016cel$g)
# outlier samples (see below)
rem <- c(sample_0001 = 53L, sample_0002 = 54L, sample_0003 = 58L, sample_0004 = 59L)
We use dsgraveley2011 for the Dme_embryo
reference of the drosoRef package, so the reference can be
loaded as follows.
# reference built from dsgraveley2011
r_grav <- prepare_refdata("Dme_embryo", "drosoRef", 500)
We then convert the C. elegans gene IDs to their D. melanogaster orthologs, and stage the samples using the Drosophila reference.
dslevin2016cel$g_dmel <- format_ids(dslevin2016cel$g, glist, from = "wb_id", to = "fb_id")
ae_cel_on_dmel <- ae(dslevin2016cel$g_dmel, r_grav)
We notice gaps in the staging results, likely at timings where there are incompatible expression dynamics between the two species.
By re-building the Drosophila reference on the first 2 components, we keep only broad or monotonic dynamics of development:
# Adjusted reference (same model & data, restraining interpolation to 2 components instead of 8)
m_grav2 <- ge_im(X = dsgraveley2011$g, p = dsgraveley2011$p, formula = "X ~ s(age, bs = 'cr')", nc = 2)
r_grav2 <- make_ref(m_grav2, n.inter = 500,
t.unit = "h past egg-laying",
metadata = list("organism"="D. melanogaster"))
We then stage the C. elegans samples on this adjusted reference.
ae_cel_on_dmel2 <- ae(dslevin2016cel$g_dmel, r_grav2)
From the above analyses, we have removed 4 outlier samples with erroneous chronological age. This is evidenced by their position on principal components, where they appear to be “older” than their specified age.
pca_cel <- stats::prcomp(t(dslevin2016cel$g), rank = 10,
center = TRUE, scale = FALSE)
This is further confirmed by their estimated age, even on the
Drosophila reference:
Tissues can develop at different rates in some species, and these rates can vary between individuals. In C. elegans, this developmental heterochrony has been shown between soma and germline (Perez et al. (2017)). By giving genes associated with or expressed in specific tissues to RAPToR for staging, it is possible to estimate development specific to these tissues.
Rockman, Skrovanek, and Kruglyak (2010)
published microarray profiling of 208 recombinant inbred lines of C.
elegans N2 and Hawaii (CB4856) strains. These 208 samples were
described as “developmentally synchronized” in the original
paper. However, Francesconi and Lehner
(2014) later showed there is significant developmental spread
between samples, spanning around 20 hours of 20°C late larval
development, essentially making this data a very high-resolution time
course hereafter called dsrockman2010.
We build a germline set of 2554 genes by combining the
germline_intrinsic,
germline_oogenesis_enriched, and
germline_sperm_enriched categories defined in Perez et al. (2017). We also define a
soma set of 2718 genes from the osc (molting)
gene category defined in Hendriks et al.
(2014).
Code to generate dsrockman2010,
francesconi_time and gsubset.
Note : set the data_folder variable to an
existing path on your system where you want to store the
objects.
data_folder <- "../inst/extdata/"
requireNamespace("wormRef", quietly = T)
requireNamespace("utils", quietly = T)
requireNamespace("GEOquery", quietly = T) # May need to be installed with bioconductor
requireNamespace("Biobase", quietly = T)
requireNamespace("limma", quietly = T)
geo_dsrockman2010 <- "GSE23857"
geo_dsrockman2010 <- GEOquery::getGEO(geo_dsrockman2010, GSEMatrix = F)
# get pdata
P_dsrockman2010 <- do.call(rbind, lapply(GEOquery::GSMList(geo_dsrockman2010), function(go){
unlist(GEOquery::Meta(go)[
c("geo_accession", "characteristics_ch1", "characteristics_ch2",
"label_ch1", "label_ch2")]
)
}))
P_dsrockman2010 <- as.data.frame(P_dsrockman2010, stringsAsFactors = F)
# get RG data from microarray
RG_dsrockman2010 <- list(
R = do.call(cbind, lapply(GEOquery::GSMList(geo_dsrockman2010), function(go){
GEOquery::Table(go)[, "R_MEAN_SIGNAL"]
})),
G = do.call(cbind, lapply(GEOquery::GSMList(geo_dsrockman2010), function(go){
GEOquery::Table(go)[, "G_MEAN_SIGNAL"]
}))
)
# normalize within channels
RG_dsrockman2010 <- limma::normalizeWithinArrays(RG_dsrockman2010, method = "loess")
RG_dsrockman2010 <- limma::RG.MA(RG_dsrockman2010) # convert back from MA to RG
# get only the RIALs (not the mixed stage controls)
X_dsrockman2010 <- RG_dsrockman2010$R
X_dsrockman2010[, P_dsrockman2010$label_ch1 == "Cy5"] <- RG_dsrockman2010$G[, P_dsrockman2010$label_ch1 == "Cy5"]
# format probe/gene ids
gpl <- GEOquery::getGEO(GEOquery::Meta(GEOquery::GSMList(geo_dsrockman2010)[[1]])$platform_id)
gpl <- GEOquery::Table(gpl)
gpl <- gpl[as.character(gpl$ID) %in% as.character(GEOquery::Table(GEOquery::GSMList(geo_dsrockman2010)[[1]])$ID_REF), ]
sel <- gpl$ORF%in%wormRef::Cel_genes$sequence_name
gpl <- gpl[sel,]
X_dsrockman2010 <- X_dsrockman2010[sel,]
# filter bad quality samples
cm_dsrockman2010 <- cor(log1p(X_dsrockman2010), method = 'spearman')
f_dsrockman2010 <- which(0.95 > apply(cm_dsrockman2010, 1, quantile, probs = .95))
X_dsrockman2010 <- X_dsrockman2010[,-f_dsrockman2010]
P_dsrockman2010 <- P_dsrockman2010[-f_dsrockman2010,]
rownames(X_dsrockman2010) <- gpl$ORF
X_dsrockman2010 <- RAPToR::format_ids(X_dsrockman2010, wormRef::Cel_genes,
from = "sequence_name", to = "wb_id")
dsrockman2010 <- list(g = X_dsrockman2010, p = P_dsrockman2010)
save(dsrockman2010, file = paste0(data_folder, "dsrockman2010.RData"), compress = "xz")
rm(geo_dsrockman2010, gpl, sel, RG_dsrockman2010, X_dsrockman2010, P_dsrockman2010,
cm_dsrockman2010, f_dsrockman2010)
library(readxl)
germline_url <- "https://static-content.springer.com/esm/art%3A10.1038%2Fnature25012/MediaObjects/41586_2017_BFnature25012_MOESM3_ESM.xlsx"
germline_file <- paste0(data_folder, "germline_gset.xlsx")
utils::download.file(url = germline_url, destfile = germline_file)
germline_set <- read_xlsx(germline_file, sheet = 3, na = "NA")[,c(1, 44:46)]
germline_set[is.na(germline_set)] <- FALSE
germline_set <- cbind(wb_id = germline_set[,1],
germline = apply(germline_set[, 2:4], 1, function(r) any(r)),
germline_set[, 2:4])
# germline_set$germline[is.na(germline_set$germline)] <- FALSE
germline <- germline_set[germline_set$germline,1]
germline_intrinsic <- germline_set[germline_set$germline_intrinsic,1]
germline_oogenesis <- germline_set[germline_set$germline_oogenesis_enriched,1]
germline_sperm <- germline_set[germline_set$germline_sperm_enriched,1]
soma_url <- "https://ars.els-cdn.com/content/image/1-s2.0-S1097276513009039-mmc2.xlsx"
soma_file <- paste0(data_folder, "soma_gset.xlsx")
utils::download.file(url = soma_url, destfile = soma_file)
soma_set <- read_xlsx(soma_file, skip = 3, na = "NA")[,c(1, 4)]
soma_set$class <- factor(soma_set$class)
soma_set$soma <- soma_set$class == "osc"
soma_set <- soma_set[soma_set$soma, 1]
gsubset <- list(germline = germline, soma = soma_set$`Gene WB ID`,
germline_intrinsic = germline_intrinsic,
germline_oogenesis = germline_oogenesis,
germline_sperm = germline_sperm)
save(gsubset, file = paste0(data_folder, "sc3_gsubset.RData"), compress = "xz")
file.remove(germline_file)
file.remove(soma_file)
rm(germline_url, germline_file, germline_set, soma_url, soma_file, soma_set)
# Copied from supp data of Francesconi & Lehner (2014)
francesconi_time <- data.frame(time =
c(4.862660944, 4.957081545, 5.051502146, 5.145922747, 5.240343348, 5.334763948, 5.429184549, 5.52360515,
5.618025751, 5.712446352, 5.806866953, 5.901287554, 5.995708155, 6.090128755, 6.184549356, 6.278969957,
6.373390558, 6.467811159, 6.56223176, 6.656652361, 6.751072961, 6.845493562, 6.939914163, 7.034334764,
7.128755365, 7.223175966, 7.317596567, 7.412017167, 7.506437768, 7.600858369, 7.69527897, 7.789699571,
7.884120172, 7.978540773, 8.072961373, 8.167381974, 8.261802575, 8.356223176, 8.450643777, 8.545064378,
8.639484979, 8.733905579, 8.82832618, 8.82832618, 8.82832618, 8.875536481, 8.875536481, 8.875536481,
8.875536481, 8.875536481, 8.875536481, 8.875536481, 8.875536481, 8.875536481, 8.969957082, 9.017167382,
9.017167382, 9.064377682, 9.064377682, 9.111587983, 9.206008584, 9.206008584, 9.206008584, 9.300429185,
9.394849785, 9.489270386, 9.489270386, 9.489270386, 9.489270386, 9.489270386, 9.583690987, 9.583690987,
9.583690987, 9.583690987, 9.583690987, 9.583690987, 9.583690987, 9.583690987, 9.630901288, 9.725321888,
9.819742489, 9.819742489, 9.819742489, 9.819742489, 9.819742489, 9.819742489, 9.819742489, 9.91416309,
10.00858369, 10.05579399, 10.05579399, 10.05579399, 10.05579399, 10.10300429, 10.19742489, 10.19742489,
10.29184549, 10.29184549, 10.29184549, 10.38626609, 10.38626609, 10.38626609, 10.43347639, 10.43347639,
10.43347639, 10.43347639, 10.43347639, 10.43347639, 10.43347639, 10.43347639, 10.43347639, 10.43347639,
10.43347639, 10.43347639, 10.527897, 10.6223176, 10.6223176, 10.6223176, 10.6223176, 10.6223176, 10.6223176,
10.6223176, 10.6695279, 10.6695279, 10.6695279, 10.7639485, 10.7639485, 10.7639485, 10.8583691, 10.8583691,
10.9527897, 10.9527897, 10.9527897, 11.0472103, 11.1416309, 11.2360515, 11.2360515, 11.3304721, 11.3304721,
11.3776824, 11.472103, 11.56652361, 11.66094421, 11.75536481, 11.84978541, 11.94420601, 12.03862661, 12.13304721,
12.22746781, 12.32188841, 12.41630901, 12.51072961, 12.60515021, 12.69957082, 12.79399142, 12.88841202, 12.98283262,
13.07725322, 13.17167382, 13.26609442, 13.36051502, 13.45493562, 13.54935622, 13.54935622, 13.54935622, 13.54935622,
13.54935622, 13.59656652, 13.69098712, 13.78540773, 13.78540773, 13.78540773, 13.87982833, 13.97424893, 14.06866953,
14.06866953, 14.06866953, 14.16309013, 14.25751073, 14.35193133, 14.44635193, 14.54077253, 14.63519313, 14.72961373,
14.82403433, 14.82403433, 14.82403433, 14.91845494, 14.96566524, 15.01287554, 15.10729614, 15.20171674, 15.29613734,
15.39055794, 15.48497854, 15.57939914, 15.67381974, 15.76824034, 15.86266094, 15.95708155, 16.05150215, 16.14592275,
16.24034335, 16.33476395, 16.42918455, 16.52360515),
geo_accession =
c("GSM588291", "GSM588174", "GSM588110", "GSM588097", "GSM588271", "GSM588203", "GSM588105", "GSM588200", "GSM588123",
"GSM588122", "GSM588115", "GSM588100", "GSM588171", "GSM588190", "GSM588229", "GSM588206", "GSM588277", "GSM588129",
"GSM588175", "GSM588151", "GSM588273", "GSM588216", "GSM588099", "GSM588117", "GSM588179", "GSM588164", "GSM588184",
"GSM588092", "GSM588285", "GSM588272", "GSM588228", "GSM588121", "GSM588170", "GSM588194", "GSM588143", "GSM588149",
"GSM588156", "GSM588220", "GSM588212", "GSM588089", "GSM588209", "GSM588253", "GSM588091", "GSM588113", "GSM588130",
"GSM588202", "GSM588191", "GSM588244", "GSM588227", "GSM588197", "GSM588233", "GSM588292", "GSM588163", "GSM588196",
"GSM588224", "GSM588283", "GSM588267", "GSM588257", "GSM588221", "GSM588274", "GSM588090", "GSM588114", "GSM588195",
"GSM588265", "GSM588182", "GSM588093", "GSM588157", "GSM588251", "GSM588177", "GSM588188", "GSM588269", "GSM588145",
"GSM588205", "GSM588162", "GSM588210", "GSM588166", "GSM588125", "GSM588252", "GSM588207", "GSM588173", "GSM588102",
"GSM588286", "GSM588107", "GSM588238", "GSM588189", "GSM588106", "GSM588295", "GSM588192", "GSM588134", "GSM588183",
"GSM588103", "GSM588198", "GSM588293", "GSM588218", "GSM588259", "GSM588234", "GSM588137", "GSM588152", "GSM588133",
"GSM588250", "GSM588168", "GSM588235", "GSM588148", "GSM588279", "GSM588140", "GSM588241", "GSM588111", "GSM588231",
"GSM588128", "GSM588131", "GSM588101", "GSM588088", "GSM588281", "GSM588159", "GSM588249", "GSM588290", "GSM588118",
"GSM588154", "GSM588136", "GSM588268", "GSM588204", "GSM588160", "GSM588135", "GSM588098", "GSM588294", "GSM588225",
"GSM588181", "GSM588248", "GSM588096", "GSM588217", "GSM588147", "GSM588176", "GSM588116", "GSM588146", "GSM588127",
"GSM588104", "GSM588108", "GSM588262", "GSM588223", "GSM588161", "GSM588237", "GSM588172", "GSM588284", "GSM588256",
"GSM588165", "GSM588211", "GSM588242", "GSM588169", "GSM588240", "GSM588264", "GSM588219", "GSM588287", "GSM588124",
"GSM588178", "GSM588167", "GSM588258", "GSM588232", "GSM588141", "GSM588112", "GSM588208", "GSM588215", "GSM588132",
"GSM588278", "GSM588275", "GSM588155", "GSM588153", "GSM588109", "GSM588185", "GSM588138", "GSM588094", "GSM588226",
"GSM588236", "GSM588266", "GSM588119", "GSM588222", "GSM588246", "GSM588150", "GSM588261", "GSM588201", "GSM588247",
"GSM588186", "GSM588280", "GSM588255", "GSM588288", "GSM588260", "GSM588139", "GSM588245", "GSM588270", "GSM588276",
"GSM588199", "GSM588254", "GSM588120", "GSM588144", "GSM588243", "GSM588214", "GSM588180", "GSM588126", "GSM588282",
"GSM588187", "GSM588158", "GSM588193", "GSM588213", "GSM588230", "GSM588263", "GSM588142", "GSM588095"),
stringsAsFactors = F
)
rownames(francesconi_time) <- francesconi_time$geo_accession
To observe the heterochrony of C. elegans soma and germline development, we must can capture tissue-specific age, and thus stage samples with RAPToR while restricting genes to the tissues of interest.
First, we stage with all available genes to get the “Global age”. Then, we restrict genes to the previously mentioned sets to estimate “Soma age” and “Germline age”.
Since PCA or ICA components summarize the expression dynamics of the data, we can evaluate our results by plotting them along the different ages. For example, we can expect that components corresponding to particular tissues will have noticeably cleaner dynamics along their respective age estimates (e.g. oscillatory molting dynamics would be less noisy with soma age than germline age).
dsrockman2010$g <- limma::normalizeBetweenArrays(dsrockman2010$g, method = "quantile")
dsrockman2010$g <- log1p(dsrockman2010$g)
We use a young-adult references for C. elegans to stage the samples.
r_ya <- prepare_refdata("Cel_YA_1", "wormRef", n.inter = 400)
ae_dsrockman2010 <- ae(dsrockman2010$g, r_ya)
We find our estimates match the previously inferred ages by Francesconi and Lehner (2014).
Expression dynamics in the data can be observed through ICA components, and enrichment in their gene loadings (or contributions) allows us to associate them to specific developmental processes.
First, we decide how many ICA components to extract from the number of PCA components needed to explain \(99\%\) of the variance.
pca_rock <- summary(stats::prcomp(t(dsrockman2010$g), rank = 30, center = TRUE, scale = FALSE))
sum(pca_rock$importance["Cumulative Proportion",] < .85) + 1
#> [1] 30
ica_rock <- ica::icafast(t(scale(t(dsrockman2010$g), center = T, scale = F)), nc = 30)
We can plot all components along the estimated age of the samples :
A lot of these components don’t have a definite link to
developmental processes, which is expected given the (intended) genetic
background heterogeneity in the samples. We can have a closer look at
components that clearly capture development, and the enrichment of their
loadings for gene sets of interest.
dev_comps <- 1:6
From the plots above, we can establish that
Now, we stage the samples using only germline or soma gene subsets.
ae_soma <- ae(
dsrockman2010$g[rownames(dsrockman2010$g)%in%gsubset$soma,], # select soma gene subset
r_ya)
ae_germline <- ae(
dsrockman2010$g[rownames(dsrockman2010$g)%in%gsubset$germline,], # select germline gene subset
r_ya)
We notice the soma estimates of two samples (marked with \(\small{\triangle}\)) are quite off from their global or germline age. This is due to both the fact we used a small gene set for inferring age, and that very similar expression profiles can occur at different times in oscillatory profiles (which is the case for molting genes).
If we look at the correlation profiles of one of these samples on the 3 estimates, we can clearly see 2 peaks for the soma:
This is the perfect situation to use a prior. We can input the global age as a prior to favor the correct peak of the soma correlation profile for the erroneous samples.
ae_soma_prior <- ae(
dsrockman2010$g[rownames(dsrockman2010$g)%in%gsubset$soma,], # select soma gene subset
r_ya,
prior = ae_dsrockman2010$age.estimates[,1], # gaussian prior values (mean)
prior.params = 10 # gaussian prior sd
)
This now shifts our estimate to the first peak, while the correlation profile (central black line) is unchanged. (The dotted lines correspond to bootstrap estimates from random gene sets and are thus different).
At the same time, all the other estimates are essentially unchanged.
# 80 & 141 are the indices of the offset samples
mean(ae_soma$age.estimates[-c(80,141), 1] - ae_soma_prior$age.estimates[-c(80,141), 1])
#> [1] -0.014664
Now, we can observe the expression dynamics in the data along tissue-specific estimates.
A expected, components IC1, IC4, and IC5 (previously associated with soma), are much cleaner along the soma age. At the same time, germline-associated components IC2 and IC3 appear quite noisy.
The opposite is true along the germline estimates, with much cleaner dynamics on germline-associated IC2 and IC3, and increased noise for soma oscillatory dynamics in IC1, IC4, and IC5.
Note: In the RAPToR paper (Bulteau and Francesconi (2022)), we used a
different reference to stage the samples (Cel_larv_YA
instead of the presented Cel_YA_1) in order to show a
combination of soma-germline heterochrony between the reference
and the RILs, as well as among the RILs. We also shortened the analysis
for the vignette. Thus, you may note differences with the figures
presented here (e.g. the ICA does not combine reference and
RILs, so components are different; however, enrichment of soma or
germline genes to specific components/dynamics is still clear).